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Abstract 

We investigate a biologically motivated approach to 
fast visual classification, directly inspired by the recent 
work [20]. Specifically, trading-off biological accuracy 
for computational efficiency, we explore using wavelet and 
grouplet-like transforms to parallel the tuning of visual cor- 
tex VI and V2 cells, alternated with max operations to 
achieve scale and translation invariance. A feature se- 
lection procedure is applied during learning to accelerate 
recognition. We introduce a simple attention-like feedback 
mechanism, significantly improving recognition and robust- 
ness in multiple -object scenes. In experiments, the pro- 
posed algorithm achieves or exceeds state-of-the-art suc- 
cess rate on object recognition, texture and satellite image 
classification, language identification and sound classifica- 
tion. 

1 Introduction 

Automatic object recognition and image classification 
are important and challenging tasks. This paper is inspired 
by the remarkable recent work of Poggio, Serre, and their 
colleagues [20], on rapid object categorization using a feed- 
forward architecture closely modeled on the human visual 
system. The main directions it departs from that work are 
twofold. First, trading-off biological accuracy for com- 
putational efficiency, our results exploit more engineering- 
motivated mathematical tools such as wavelet and grouplet 
transforms |l3l[T2|, allowing faster computation and limit- 
ing ad-hoc parameters. Second, the approach is generalized 
by adding a degree of feedback (another known component 
of human perception), yielding significant performance and 
robustness improvement in multiple-object scenes. In ex- 
periments, the resulting scale- and translation-invariant al- 
gorithm achieves or exceeds state-of-the-art performance in 
object recognition, but also in texture and satellite image 
classification, and in language identification. 



2 Algorithm description 

2.1 Feature computation and classification 

As in l20lL the algorithm is hierarchical. In addition, 
motivated in part by the relative uniformity of cortical 
anatomy (T4j [21], the two layers of the hierarchy are 
made to be computationally similar, as shown in Fig. [T] 
Layer one performs a wavelet transform [13] in the Si 
unit followed by a local maximum operation in the Ci 
unit. The transform in the S 2 unit in layer two is similar 
to the grouplet transform [12], and is followed by a global 
maximum operation in the C2 unit. 
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FigurG 1 . Algorithm overview. 

Si : Wavelet transform. The frequency and orientation 
tuning of cells in visual cortex VI can be interpreted as 
performing a wavelet transform of the retinal image fT3ll . 
Let us denote y) a gray-level image of size N\ x N 2 . 
A translation-invariant wavelet transform is performed on 
the image: 
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where k = 1, 2, 3 denotes the orientation (horizontal, verti- 
cal, diagonal), ip k (x, y) is a wavelet function and Wf are 
the wavelet coefficients. Scale invariance is achieved by a 



normalization 

Si[u,v,3,k) = — rjp , (2) 

H«^llsupp(^) 

where ll/llsupp(^ fc ) * s me i ma S e ener gy within the support 
of the wavelet i> k { x ~ u ^f~ v ). One can verify that 

S!(u,vJ,k) ~ S[(2Pu,20v,l3j,k) (3) 
where Si and S[ are the coefficients of f(x, y) and of its 
2^ -time zoomed version f(x/2 f3 : y/2 f3 ). The normaliza- 
tion also makes the recognition invariant to global linear 
illumination change. 

Ci : Local maximum Limited translation invariance is 
achieved at this stage by keeping the local maximum of Si 
coefficients in a subsampling procedure: 

Ci(u,v,j,k)= max Si(V,?/,j, 

u'e[2 j (u-l) + 1,2J u),v'e[2i (v-1) + 1,2J v ) 

the maximum being taken at each scale j and orientation /c 
within a spatial neighborhood of size proportional to 2 J x 
2 J . The resulting C\ map at scale j and orientation k is thus 

of size7Vi/2^' x N 2 /V . 

S 2 : Grouplet-like transform. Cells in visual cortex V2 
and V4 have larger receptive fields comparing to those in 
VI and are tuned to geometrically more complex stimuli 
such as contours and corners 1 19 ]. The geometrical grou- 
plets recently proposed by Mallat 1 12 ] imitate this mecha- 
nism by grouping and re-transforming the wavelet coeffi- 
cients. 

The procedure in S 2 is similar to the grouplet transform. 
Instead of grouping the wavelet coefficients with a multi- 
scale geometrically adaptive association field and then re- 
transforming them with Haar-like functions as in fT2lL re- 
sponses of S 2 are obtained via inner products between C\ 
coefficients and sliding patch functions of different sizes: 

N 1 /2 :i N 2 /2 j 3 
u' — l v' = l k=l 

(5) 

where Pi of support size Mi x Mi x 3 are patch func- 
tions that group the 3 wavelet orientations in a square of 
size Mi x Mi. 

While the grouplet functions are adaptively chosen to 
fit the geometry in the image 1 12], the patch functions Pi, 
i = 1 , . . . , N are learned with a simple random sampling 
as in l20l : each patch is extracted at a random scale and 
a random position from the C\ coefficients of a randomly 
selected training image, the rationale being that patterns 
that appear with high probability are likely to be learned. 

C 2 : Global maximum. A global maximum operation in 
space and in scale is applied on S2 and the resulting C 2 
coefficients 

C 2 {i) = maxS 2 (ii, v, j, i) (6) 

u,v,j 

are thus invariant to image translation and scale change. 



Classification The classification uses C 2 coefficients as 
features and thus inherits the translation and scale invari- 
ance. While various classifiers such as SVMs can be used, 
a simple but robust nearest neighbor classifier will be ap- 
plied in the experiments. 

2.2 Feature selection 

Structures that appear with a high probability are likely 
to be learned as patch functions through random sampling. 
However, they are not necessarily salient and neither are the 
resulting C 2 features. This suggests active selection of the 
learned patches Q For example, Lowe and Mutch have con- 
structed sparse patches by retaining one salient direction at 
each position fTSl . 

A simple patch selection is proposed here by sorting the 
variances of the C 2 coefficients of the training images. A 
small C 2 (i) variance implies that the corresponding patch 
Pi is not salient. Fig.|2]-a plots the variance of the C 2 coef- 
ficients of the motorcycle and the background images in the 
Caltech5 database (see Fig. [4]), the S 2 patches being learned 
from the same images. Out of the 1000 patches, 200 salient 
ones whose resulting C 2 have non-negligible variance are 
selected. Other patches usually correspond to nonsalient 
structures such as a common background and are therefore 
excluded. Fig.|2]-b and c show that after patch selection the 
200 C 2 coefficients are mainly positioned around the ob- 
ject, as opposed to the 1000 C 2 coefficients spreading over 
all the image prior to patch selection. The recognition us- 
ing these salient patches is not only more robust but also 5 
times faster. 




a b c 

Figure 2. a. Variance of the C2 coefficients before 
patch selection, b and c. Positions of the C2 coefficients 
before and after patch selection (marked by crosses). 

2.3 Feedback 

Feedback [Q33 01 [221 allows tracing back object posi- 
tions, focusing attention on the objects one by one and 
thus improving recognition performance in multiple-object 
scenes. 

Object positioning For simplicity the feedback procedure 
is discussed in a two-object scene but can be applied in the 
case of multiple objects. C 2 coefficients are placed around 
the two objects after selection, as shown in Fig.[3]-a. Using 
a clustering algorithm such as the K-means algorithm, one 

besides improving computational efficiency of the algorithm, such 
reorganization is inspired both by a similar process thought to occur after 
immediate learning, notably during sleep, and by the relative uniformity of 
cortical anatomy 1 14 1 which suggests enhancing computational similarity 
between the two layers. 



is able to locate the two objects as illustrated in Fig.[3]-b. 

Object identification While one could recalculate the fea- 
tures of the attended object cropped out from the whole im- 
age, i.e., concentrate all the visual cortex resource on a sin- 
gle object, a faster procedure identifies the attended object, 
say object A, using directly the lower-dimensional feature 
vector C2A, composed of the C2 coefficients corresponding 
to A already calculated in the feedforward pathway. This 
can be implemented by reclassifying C2A using subsets of 
the C2 coefficients of the training images extracted at the 
same coordinates of as shown in Fig.[3]-c. Discarding 
the coordinates that are located on the irrelevant object B 
in the test image disambiguates the classification and im- 
proves the recognition of the object A. 





Figure 3. Feedback in a two-object scene, a. Posi- 
tions of C2 coefficients are marked by crosses, b. C2 
coefficients are clustered (represented by circles vs 
crosses), c. Feature coefficients of the training images 
are grouped, the coordinates being in line with the clus- 
tering of the coefficients of the test image. Rectangles 
and ellipses represent the two groups. 

3 Experiments 

All the experiment results were obtained with the same 
algorithm configuration. Daubechies 7-9 wavelets of 3 
scales were used in Si. In S 2 1000 patches Pi of 4 dif- 
ferent sizes M x M x 3 with M = 4, 8, 12, 16, 250 for 
each, were learned from the training images. The classifier 
was the simple nearest neighbor classification algorithm. 

For texture and satellite image classification as well as 
for language identification, one sample image of size 512 x 
512 was available per image class and was segmented to 16 
non-overlapping parts of size 128 x 128. Half were used 
for training and the rest for test. 

3.1 Object recognition 

For the object recognition experiments we used 4 data 
sets that are airplanes, motorcycles, cars (rear) and leaves, 
plus a background class from the Caltech5 databas^] some 
sample images being shown in Fig. [4] The images are 
turned to gray-level and rescaled in preserving the aspect 
ratio so that the minimum side length is of 140 pixels. A 
set of 50 positive images and 50 negative images were used 
for training and another set for test. 



Table [T] summarizes the object recognition. The perfor- 
mance measure reported is the ROC accuracy]^] Results ob- 
tained with the proposed algorithm are superior to previous 
approaches El |24) and comparable to 1 20 ] but at a lower 
computational cost (in Matlab code about 6 times faster 
with feature selection). Fig.|5]-d shows that the performance 
is improved when the number of C2 features increases and 
is in general stable with 200 features. 



Figure 4. Sample images from Caltech5. From left 
to right: airplanes, motorcycles, cars (rear), leaves and 
background. 
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Table 1 . Object recognition performance. 




Number of features 



Figure 5. a-c. Pairs of textures, d. Performance vs 
numbers of C2 features. 

3.2 Texture classification 

Figs.[5]-a,b,c and Fig.[6]show respectively 3 pairs of tex- 
tures that were used for binary classification and a group 
of 10 textures that were used for multiple-class (10-class) 
classification, all from the Brodatz databas^] As summa- 
rized in Table [2j the proposed algorithm achieved perfect 
results for binary classification and for the challenging mul- 
tiple class classification its performance was comparable to 
the state-of-the-art methods [TTl . Indeed the random 
patch extraction applied in the algorithm is ideal for classi- 
fying stationary patterns such as textures. Fig. [5] shows that 
stable performance is achieved with as few as 40 features, 
which confirms the good texture classification results and 
the robustness of the algorithm. 



http://www.robots.ox.ac.uk/~vgg/data3.html 



ROC accuracy: R = 1 — ((1 — p)x + p(l — y)), where x and y are 
respectively the false positive rate on the negative samples and true positive rate on 
the positive samples, p is the proportion of the positive samples. 

4 Fig.||}a,b,c: D4 and D84; D12 and D17; D5 and D92. Fig.^: D4, 
D9, D19, D21, D24, D28, D29, D36, D37, D38. 




Figure 6. A group of 10 textures. 
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Table 2. Texture classification performance. From 
left to right: proposed method, average and best per- 
formance of the algorithms summarized in 1 17 |, other 
methods. N/A means that the results were not shown. 



Classifying the whole Brodatz database (111 textures) is 
a more challenging task. Combining C2 coefficients with 
the histogram of the wavelet approximation coefficients as 
features, the proposed algorithm achieved 87.8% accuracy 
for the Ill-texture classification, comparable to the 88.2% 
accuracy rate reported in [ 7 ] obtained with a state-of-the-art 
texture classification approach. 

3.3 Satellite image classification 

Fig. [7] displays 4 classes of satellite images at 0.5 m res- 
olution: urban areas, rural areas, forests and sea. Since 
access to images at other resolutions is restricted, we sim- 
ulated the images at resolutions 1 m and 2 m by Gaussian 
convolution and sub- sampling. 

The first experiment tested the multi-class classification 
of mono-resolution images shown in Fig. [7] 100% clas- 
sification accuracy was achieved for images of all the 4 
classes. The second experiment validated the scale invari- 
ance of the proposed algorithm. Images at resolution 0.5 
m were used to train the classifier while the classification 
was tested on images at resolutions 1 m and 2 m. Again 
the classification accuracy was 100%, same as reported in a 
recent work ifTTIl and significantly higher than earlier meth- 
ods O referenced therein. In addition, image resolution 
is assumed to be known in |[TT1l . whereas the proposed al- 
gorithm does not need this information, thanks to its scale 
invariance. 




Figure 7. Satellite images. From left to right: forest, 
urban areas, rural areas, sea. 



3.4 Language identification 

Language identification aims to determine the under- 
lying language of a document in an imaged format, and 
is often carried out as a preprocessing of optical charac- 
ter recognition (OCR). Based on principles totally different 
from traditional approaches [10], the proposed algorithm 
achieved 100% success rate in a 8-language identification 
task, as shown in Fig [8] 



j hTi -ryui nmnrvD 

ino "7N niojiyi nnni 
■7ml rmavo no 

3 nnpnin w/ni, y "? 
, HK~\Hj? Kiinn, ri'Vi 



^'Bf;j?^f'i:iJ-j.rj 

i-l I -1 ■ V v II 

,.:J imf!r*&, "1 
,11.: - j i'-'-l-;!. 'I- 

U ViiMli*. 

-•'''In •^•^-IJ I 

'feJl'SA ft 
. 'ilr J :.■„■; I I 
<r l-.O j'J-f 

<; 

■ • : ,• r ; ; :■ C 

I AHA) CMLi 



The mission of MIT is 
serve the nation and ihi 
to working wilh others 
education that combine 
campus community. VV 
offee lively lor ihc belli 
charter. The opening it 
a new kind of :iuk'pe:it 
practicable. He believe 

y-s- l^ltHii : 
'V L -T -T" I ^ii 'n r 1 i- i4 



r,Riyyj.p\)Gi:MV. To 11(50 
KOniO: t>[: onoiac. uno 
201.3, (inOTlUftTftl trfi « 

6tI yia vtt ^iT>:uy}h:\ n 
w\a iv i '.* 1 1 1 1 ■ . i *i .1 t:-| <Vl 
fiioncnriKCjv SirifliKttau 



: .1 I'm :<uAi6uf;vo 



[|. 



IHCnOpTHpOHBfH IHej 

>afHyio ryMiiHiTTapH 

tojicm pajRHTHJi Tpa 
■imhtc BbicTynHJi np 
esme Rcero, oh nop 
iory: «Cpt>K mohx n 
roMy, 'tto xoTen not 



Figure 8. From top to bottom, left to right: docu- 
ment texts in Arabic, Chinese, English, Greek, Hebrew, 
Japanese, Korean, Russian. 



3.5 Sound Classification 

The main idea is to directly extend the above algorithm 
to sound applications is to view time-frequency representa- 
tions of sound as textures. Preliminary experiments suggest 
this may be a fruitful direction of research. 

Fig. [9] illustrates 5 types of sounds and samples of their 
log- spectrograms. 2 minutes excerpts of each sound were 
collected. The spectrograms were segmented (in time) into 
segments of 5 seconds. Half were used for training and 
the rest for test. A direct application of the proposed algo- 
rithm using the spectrograms as the visual patterns resulted 
in 100% accuracy in the 5-sound classification. 

3.6 Feedback: multiple-object scenes 



Recognition performance tends to degrade when multi 
pie stimuli are presented in the receptive field. Fig. [TO 



shows an example of a multiple-object scene in which one 
searched an object, say an airplane, through a binary clas- 
sification against a background image. Due to the pertur- 
bation from the coexisting stimuli, the feedforward recog- 
nition accuracy is as low as 74%. The feedback procedure 



introduced in Subsection 2.3 improves considerably the ac- 
curacy to 98% by focusing attention on each object in turn. 




Figure 9. From left to right: violin, piano, trumpet, 
flute, speech. The figures in the second row are spectro- 
grams of the sounds illustrated in the first row. 




a b 
Figure 10. a. A 4-object scene, b. C 2 coefficients clus- 
tering. 



4 Conclusion and future work 

Inspired by the biologically motivated work of |20], we 
have described a wavelet-based algorithm which can com- 
pete with the state-of-the-art methods for fast and robust ob- 
ject recognition, texture and satellite image classification, 
language recognition and sound classification. A feedback 
procedure has been introduced to improve recognition per- 
formance in multiple-object scenes. 

Potential applications also include video archiving (se- 
mantic video analysis), video surveillance, high-throughput 
drug development, texture retrieval, and robotic learning by 
imitation. 

To further improve and extend the algorithm, a key 
aspect will be a more refined use of feedback between dif- 
ferent levels. Such feedback will naturally involve stability 
and convergence questions, which will in turn both guide 
the design of the algorithm and shape its performance. In 
addition, contrary to the nervous system, the algorithm 
need not be constrained by information transmission delays 
between different levels. Preliminary ideas in this direction 
are briefly discussed in the appendix. 
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on invariant image recognition. 

APPENDIX 

A Dynamic System Perspective 
A.0.1 Basic algorithm 

The first step towards introducing a dynamic systems perspective 
aimed at further development of feedback mechanisms is simply 
to rewrite the algorithm in terms of differential equations, which 
puts it in a form more suitable to subsequent analysis of stability 
and convergence. 

Let xi be the output of the S1/C1 layer, and X2 be the output 
of the S 2 1 C 2 layer (in our static implementation above, we simply 
have xi = C\ and x 2 = C 2 ). 

For a single object, the basic algorithm can be trivially com- 
puted by a dynamic system of the form 

xi - -fci(xi - Ci) 

X 2 - -fc>(x 2 - C 2 ) 

For multiple objects, the clustering process described in sec- 
tion [23] can be implemented by introducing a scalar state X3, 
which spikes for each object in sequence 

X3 = P(X 3 ,C 2 ) 

with spike amplitude equal to 1 — the function p(xs, C 2 ) is dis- 
cussed later in this section and in a companion paper. The dynam- 
ics of X2 can be modified in turn so that states corresponding to 
each object appear in sequence according to the state X3 

xi - -fci(xi - Ci) 

x 2 = -fc>(x 2 - x 3 k(C 2 A)) 

X3 = p(x 3 ,C 2 ) 

where, componentwise, \£(C 2 a) = C 2 a where C 2 a is active and 
and \<l(C 2 a) — otherwise. Note that X3 smoothly transitions be- 
tween and 1 according to the attended object. The positive gain 
k 2 is chosen such that k 2 T ^> 1, where T is the spike duration, 
itself a fraction of the interspike period. 

The above equations simply implement the basic algorithm 
and display objects in sequence, without introducing any new fea- 
ture at this point. 

Techniques for globally stable spike-based clustering are de- 
scribed in a companion paper, based on modified FitzHugh- 
Nagumo neural oscillators |3][l5), similar to (TJ, 

Vi = 3vi - Vi —vJ + 2 — Wi + Ii (7) 
ibi = c[a(l + tanh(pVi)) — Wi] 

where Vi is the membrane potential of the oscillator, wi is an inter- 
nal state variable representing gate voltage, U represents the ex- 
ternal current input, and a, /3 and c are strictly positive constants. 
Using a diagonal metric transformation = diag( y / ca?/3, 1), one 
easily shows, similarly to (23), that 

0J0- 1 < diag( 3+^,0) 

where J is the Jacobian matrix of ([7|, leading to simple global 
stability conditions based on fT6l (section 2.2). 



A.l Generalized diffusive connections 

One of the most immediate additional feedback mechanisms to 
be explored is that of generalized diffusive connections ((T6), sec- 
tion 3.1.2). In a feedback hierarchy, these correspond to achieving 
consensus between multiple processes of different dimensions. 

A.2 Tracking of time-varying images 

Similarly to |9], composite variables for dynamic tracking can 
be used at every level, based on both top-down an bottom-up infor- 
mation. This allows one to implicitly introduce time-derivatives of 
signals in the differential equations, without having to measure or 
compute these terms explicitly. 
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